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1. INTRODUCTION 

Road incidents are one of the most common causes of death worldwide today. As a result, pedestrian 
detection algorithms, which find all pedestrians in an image, have gained popularity in computer vision and 
artificial intelligence communities. Common pedestrian detection systems (PDS) are built for bright weather 
[1]-[9]. However, an applicable PDS is also required to perform well in rainy or snowy conditions. 

Rain is one of the commonest dynamic weather phenomena. Images taken in rainy conditions 
frequently suffer from local degradations [10], [11] such as low visibility and distortion, which directly impair 
visual perception quality and make them unfit for sharing and use. Furthermore, rainwater-induced artifacts 
may drastically impact the performance of numerous machine vision solutions, such as smart driving and video 
monitoring systems. 

Machine learning (ML) field focuses on the development of computer algorithms, which exploit data 
to learn patterns, make predictions, and increase their performance over time by more data. Lately, taking 
advantage of the invention of convolutional neural networks [12]—[15], particularly the establishment of the 
pix2pix [16] network architecture and the adversarial training strategy, the performance of single image de- 
raining has experienced notable progress. By training a rainy-to-clean image translation model with synthetic 
rain streak or raindrop datasets, a rainy image can be effectively repaired by eliminating the artifacts despite 
the presence of rain streaks or raindrops with different scales, forms, and thicknesses. 
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In this paper, we investigate the impact of images taken in rainy conditions on pedestrian classification 
tasks using mAP measurement. In addition, we assess the effectiveness of our proposed PDS based on Pix2Pix 
and you only look once (YOLO) v3, in comparison to others models based on noise-removing masks. The rest 
of this article is organized: Section 2 describes in detail the algorithms that will be employed. Section 3 
introduces our proposed pedestrian detection system for combating adversarial weather attacks. In the next 
section, we present and discuss the results of our research. The last section addresses the paper's conclusion. 


2. METHODS AND TOOLS 
2.1. YOLO v3 algorithm 

YOLO [17] is an open source object detection and classification algorithm based on convolutional 
neural networks (CNN). It is able to predict which objects are present in an image and their positions at first 
glance. The primary benefit of this approach is that the whole image is evaluated by a singular neural network. 
The network can process images in real-time at 45 frames a second (FPS) using Nvidia Titan X, and a simplified 
version fast YOLO can reach 155 FPS with better results compared than most real-time detectors. 

YOLO starts detecting objects by dividing the input image into SxS gray, and each grid predicts C 
class probabilities, B bounding locations, and confidence scores. Each boundary box includes 5 variables: x, 
y, w, h, and a box confidence score. The confidence score represents how likely the bounding box includes an 
object and how precise the boundary box is. x and y are offsets to the corresponding cell. The bounding box 
width w and height h are normalized by the width and height of the image. Each cell has C conditional class 
probabilities. The final output of the YOLO has a shape of (S, S, Bx5 + C). The structure of the YOLO v3 
algorithm is presented in Figure 1. 
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Figure 1. YOLO v3 architecture 


2.2. Average filter 

The average filter [18] operates by passing through the image pixel by pixel. At each location, the 
core element is replaced with the mean of the whole pixel values under the kernel region. The 3 by 3 and 5 by 
5 filters are shown in (1) and (2) respectively. 
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2.3. Gaussian filter 

The Gaussian filter [19] is a linear filter commonly used in image processing for blurring and 
removing details and noises from images. It has a different kernel that reflects the form of the Gaussian (bell- 
shaped) hump. An image I filtered by Gaussian convolution is given by (3), where o is the standard deviation 
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of the distribution, p denotes the central pixel of the kernel, q represents the positions of its neighbors, and G, 
denotes the 2D Gaussian kernel (4). 


GC[lp] = Lges Go (lle — alg 3) 
Go(x) = tye 2a? (4) 


The operation of the Gaussian convolution is not affected by the image content. The influence of one 
pixel on another in an image is defined only by their distance in the image, not by the image values themselves. 
The gaussian filters (kernel 3x3, o=0.8) and (kernel 5x5, o=1.1) are depicted in (5) and (6), respectively. 
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2.4. Median filter 

The median filter [20] is a nonlinear filter that replaces the central pixel with the median of the pixels 
under the kernel area. The central element is always replaced by one of the pixels under kernel area. This is 
not the case with average and gaussian filtering. As a result, the median filter is less vulnerable to intense values 
(called outliers) than the average filter. 


2.5. Bilateral filter 

Bilateral Filter [21] is a technique for smoothing images and reducing noise without blurring large, 
sharp edges. It has the same definition as Gaussian convolution. On the other hand, it considers the value 
differences between neighbors. It's abbreviated as BF [,] and has the following definition, 


1 
BF{I,| = Wy UIs Go, (lp ~ q|l)Go,.Ip —1q\)Iq (7) 
Where normalization factor W, ensures pixel weights sum to 1.0: 


W, = diges Go, (lp ‘al qll)Go, Ip = Iq\) (8) 


Parameters 6; and 6; will specify the amount of filtering for the image. G,, is a spatial Gaussian 
weighting that decreases the influence of distant pixels, and Gg, is a range Gaussian that decreases the influence 
of pixels q when their intensity values differ from I,. Bilateral filters are generated by bilateralFilter(src, d, 
sigmaColor, sigmaSpace) function [22], this function accepts the following parameters, 

— src: the source image. 

—  d: the diameter of the pixel neighborhood. 

—  sigmaColor: the filter sigma in the color space. 

—  sigmaSpace: the filter sigma in the coordinate space. 


2.6. Non local means filter 

Non-local means method [23] fills the pixel's value with an average of the values of a distribution of 
other pixels: small blocks centered on other pixels are compared to the block centered on the pixel of interest, 
and the average is only conducted for pixels with blocks that are similar to the current block. As a consequence, 
this approach is capable of restoring textures that were previously blurred by other noising algorithms. Non 
Local Means filters are generated by fastNIMeansDenoisingColored(src, h, hColor, templateWindowSize, 
search WindowSize) function [22], this function accepts the following parameters, 
—  h:A parameter that controls the filter strength for the luminance component. A larger h value removes all 

noise but also all image details; a smaller h value conserves details but also some noise. 

—  hColor: Identical to h, but it's for color images. 
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—  templateWindowSize: The template block's size in pixels is used to calculate weights. 
—  searchWindowSize: The window size in pixels used to calculate the weighted mean for a particular pixel. 


2.7. Adversarial network 

Generative adversarial networks (GAN) [16] are a class of machine learning frameworks designed by 
Ian Goodfellow and al. GAN is composed of two parts: i) the generator learns to build real data. For the 
discriminator, the created instances serve as negative training examples, ii) the discriminator learns to make a 
distinction between fake and real data generated by the generator. When the generator makes implausible 
results, it is punished by the discriminator. 

When training starts, the generator produces fake data, which the discriminator quickly recognizes. 
As training advances, the generator comes closer to generating output that fools the discriminator as training 
progresses. Finally, if generator training is efficient, the discriminator becomes less capable of distinguishing 
between real and fabricated images. It begins to mistakenly classify fake data as real, and its accuracy decreases 
as a result. 

Both models are based on neural networks. The discriminator input is connected to the generator 
output directly. The discriminator's classification is used by the generator to update its weights via back- 
propagation. As a result, both models have been trained concurrently in an adversarial process in which the 
generator tries to trick the discriminator while the discriminator attempts to spot the fake pictures. The GAN 
framework is presented in Figure 2. 


Realimage ————_———> 
Discriminator ——b> Real/Fake 
Noise——— Generator 


Generated 
image 


Backpropagation 


Figure 2. GAN architecture 


3. PROPOSED METHOD 

To eliminate the rainy effect from images, we attempt to directly turn a rainy image into an unrainy 
image pixel by pixel directly. This study was inspired by the effectiveness of Pix2Pix GANs in translating one 
image into another. The Pix2Pix technique uses a conditional GAN (cGAN), in which the output picture is 
produced in response to an input, in our particular scenario a source image. 

The discriminator is a type of image classification model that uses a Deep CNN. Specifically, it 
performs conditional-image classification by taking both the source image (e.g. a rainy image) and the target 
image (e.g. an unrainy image) as input, and then predicts the probability of whether the target image is real or 
a fake version of the source image. The PatchGAN model, which is based on the efficiency of the model's 
receptive field, is used to define the relationship between one of the model's outputs and the number of pixels 
in the input image. This model is designed so that each output prediction maps to a 70x70 block of the input 
image. The advantage of using this model is that it can handle images of different sizes, such as those larger or 
smaller than 256x256 pixels. During training, the model generates a patch of predictions by concatenating two 
input images. To optimize the model, it uses log loss and applies a weighting factor of 0.5 to updates, which is 
a technique recommended by the Pix2Pix authors. This weighting slows down changes to the discriminator 
model compared to the generator model, which helps improve the overall training process. The flowchart of 
our proposed discriminator is presented in Figure 3. 

In comparison to the discriminator, the generator is more complicated. The generator employs a 
U-Net architecture as an encoder-decoder model. It generates a target image (unrainy image) from a source 
image (rainy image). To achieve this, the input image is first downscaled or encoded to a bottleneck layer, and 
then the condensed representation is upscaled or decoded to the output image size. Figures 4 to 6 depict the 
flowcharts for the encoder, decoder, and generator respectively. 
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in_sre_image = Input(shape=image_shape) target image input 
in_target_image = input(shape=image_shape) concatenate images channel-wise 


[ merged = Concatenate()([in_src_image, in_target_image)) | C64 


d = Conv2D(64, (4,4), strides=(2,2), padding="same’, kernel_initializer=init)(merged) 


d = LeakyReLU(alpha=0.2)(d) C128 


d = Conv2D(128, (4,4), strides=(2,2), padding="same’, kernel_initializer=init)(d) 


d = BatchNormalization({)(d) C256 
d = LeakyReLU(alpha=0.2)(d) 

d = Conv2D(256, (4,4), strides=(2,2), padding="same’, kernel|_initializer=init)(d) 

d = BatchNormalization()(d) C512 


\d = LeakyReLU(alpha=0.2)(d) 
d = Conv2D(512, (4,4), strides=(2,2), padding="same’, kerne|_initializer=init)(d) 
d = BatchNormalization()(d) 

d = LeakyReLU(alpha=0.2)(d) 

d = Conv2D(512, (4,4), padding='same’, kernel_initializer=init)(d) 
d = BatchNormaiization()(d) 

d = LeakyReLU(alpha=0.2)(d) 

d = Conv2D(1, (4,4), padding="same’, kernel_initializer=init)(d) 
patch_out = Activation(‘sigmoid’\(d) define model 


model = Model([in_src_image, in_target_image], patch_out) 


opt = Adam(Ir=0.0002, beta_1=0.5) 
model.compile(loss='binary_crossentropy’, optimizer=opt, loss_weights=[0.5]}) 


return model 
End 


Figure 3. The flowchart of the discriminator algorirhm 


second last output layer 


patch output 


compile model 


The discriminator model is trained particularly on both real and fake images, whereas the generator 
model is not. Furthermore, it is trained using the discriminator and updated to reduce the discriminator's 
predicted loss for "real" generated images. In this way, it is encouraged to produce more realistic images. The 
weights are reemployed in this composite model, but they are identified as untrainable because the 
discriminator is updated separately. The composite model is updated with two targets: one confirming that the 
produced images were authentic (cross-entropy loss), which forces the generator to make large weight updates 
to produce more realistic images, and the actual real translation of the image, which is compared to the 
generator model's output (L1 loss). 

GAN models hardly ever converge. Instead, a balance is established between the generator and 
discriminator models. As a result, it is difficult to decide when to stop training. During training, we can save 
the model regularly and use it to generate sample image-to-image translations. For example, after 10 training 
epochs, we examine the generated images and choose a final model based on the image quality. Figure 7 depicts 
the Pix2Pix cGAN flowchart. 
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def define_encoder_block(layer_in, 
n_filters, batchnorm=True) 


define an encoder block 
weight initialization 


init = RandomNormal(stddev=0.02) add downsampiing layer 


g = Conv2D(n_filters, (4,4), strides=(2,2), padding="same’, kernel_initializer=init)(layer_in) | conditionally add batch normalization 


g = BatchNormalization()(g, training=True) 


g= LeakyReLU(alpha=0.2)(g) 
return g 
End 


Figure 4. The flowchart of the encoder model 


No leaky relu activation 


def decoder_block(layer_in, 
kip_in, n_filters, dropout=True 


define a decoder block 
weight initialization 


init = RandomNormal(stddev=0.02) | add upsampling layer 


g = Conv2DTranspose(n_filters, (4,4), strides=(2,2), 
padding='same’, kernel_initializer=init (layer_in 


add batch normalization 


Figure 5. The flowchart of the decoder model 


In this study, we proposed a pedestrian detection system capable of detecting pedestrians in two 
scenarios rainy and un-rainy as presented in Figure 8. The first step of our system is to determine whether it is 
raining on the input images or not using our proposed rainy detector based on deep convolutional neural 
networks as described in Figure 9. If the image it’s rainy we use our generator from the generative adversarial 
network to transform the rainy image to un-rainy image, if not it passes directly to YOLO v3 to detect the 
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positions of pedestrians in the image. Moreover, to demonstrate the effectiveness of our proposed PDS model 
against weather degradation, we compared it with multiple PDS models based on removable noise filters 
presented in Figures 10 to 14. 


<def define_generator(image_shape=(256,256 3))> 


define the standalone generator model 
weight initialization 


init = RandomNormal(stddev=0.02) | image input 


in_image = Inp (shape=image shap 


e1 = define_encoder_block(in_image, 64, batchnorm=False) 
e2 = define_encoder_block(e1, 128) 

e3 = define_encoder_block(e2, 256) 

e4 = define_encoder_block(e3, 512) bottleneck, no batch norm and relu 
e5 = define_encoder_block(e4, 512) 
e6 = define_encoder_block(e5, 512) 
e7 = define_encoder_block(e6, 512) 


b = Conv2D(512, (4,4), strides=(2,2), padding='same’, Kernel_intializer=in)(e7) 
b = Activation(‘relu')(b 


d1 = decoder_block(b, e7, 512) 

d2 = decoder_block(d1, e6, 512) 

d3 = decoder_block(d2, e5, 512) 

d4 = decoder_block(d3, e4, 512, dropout=False) | output 
d5 = decoder_block(d4, e3, 256, dropout=False) 

d6 = decoder_block(d5, e2, 128, dropout=False) 

d7 = decoder_block(d6, e1, 64, dropout=False) 


decoder model 


g = Conv2DTranspose(3, (4,4), strides=(2,2), padding='same’, kernel_initializer=init)(d7) 
out_image = Activation(‘tanh')(g) 


model = Model(in_image, out_image) |- 
return model 


define model 


Figure 6. The flowchart of the generator model 
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Figure 7. The flowchart of the Pix2Pix cGAN model 
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Figure 8. Proposed PDS based on GAN 
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Figure 9. Rainy Detector Architecture 
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Figure 10. PDS based on average filter 
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Figure 11. PDS based on Gaussian filter 
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No 
Figure 12. PDS based on median filter 
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Image—> Rainy Detector 2 YOLO V3 — detections 
No 
Figure 13. PDS based on bilateral filter 
Yes 
a Non Local Means Filter 7] 
Image— Rainy Detector —__:.. kee YOLO v3 —detections 


No 


Figure 14. PDS based on non-local means filter 


4. EXPERIMENTAL RESULTS & DISCUSSION 

For the purpose of building our GAN model, we downsized the VOC2014 dataset [24] to 256x256 
and divided it into two folders, one for training and one for testing. The train folder contains 1,000 images from 
number 000001 to 001979, whereas the testing dataset contains 3,952 images from number 001983 to 009963. 
Each image includes a pair of rainy images on the left and unrainy image on the right. Rainy images are 
generated using add_rain function from Automold source code (add_rain (clean, slant=-20, drop_length=20, 
drop_width=1, rain_type=heavy')). Additionally, we prepared our proposed Rainy detector CNN using 600 
images (from 000001 to 001190) for training and 400 images (from 001193 to 001979) for validation. 

All modules were implemented on Ubuntu 20.04 LTS with Python 3.7.13. Deep-learning networks 
were also implemented using the Tensorflow framework (version 2.6). In addition, YOLOv3 weights were 
initialized using the “yolov3.weights” COCO pre-trained model (416x416). Moreover, all experiments were 
performed using an Nvidia 1650 GPU 4GB, and an Intel(R) Core(TM) i5-9400F CPU 2.90GHz (6Cores) and 
16GB RAM. 

In our work we: 

— Load and prepare the rainy-affected images from the original image dataset. 
— Develop a Rainy Detector model to determine whether or not it is raining on the input images. 
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— Build a Pix2Pix model to transform rainy images to un-rainy images. 
— Use the final Pix2Pix generator model to transform rainy images to un-rainy images. 
— Use the pre-trained YOLO v3 model to detect pedestrians in images. 

The average precision metric (AP), which measures the region below the precision-recall graph, was 
used to evaluate our models. It is a widely used metric for evaluating the accuracy of object detectors. We 
determined the average precision in this work using the Cartucho source code [25]. The results presented in 
Table 1 show that the average, Gaussian, median, bilateral, and local means filters do not help the images with 
raindrops, and instead make the detection result worse. On the other hand, our proposed system succeeded in 
restoring rainy images to un-rainy images and achieved better pedestrian detection performance. The only 
limitation of our work is that our system can achieve about 10 FPS for non rainy images and 6 FPS for rainy 
images due to the hardware and software limitations. 


Table 1. Performance of proposed pedestrian detection systems 


Dataset 
Model VOC2014 Rainy VOC2014 
AP(%) _ FPS(Hz) _ AP(%) _ FPS(Hz) 
YOLOv3 74.38 14 26.5 14 
Average filter(3,3) & YOLOv3 73.84 10 23.34 10 
Gaussian filter(3,3) & YOLOv3 73.88 10 24.32 10 
Median filter(3,3) & YOLOv3 73.89 10 25.2 10 
Bilateral filter(3,10,10) & YOLOv3 73.95 10 26.09 10 
Non Local Means Filter(3,3,3,3) & YOLOv3 73.94 10 25.47 9 
Pix2Pix GAN rainy removal & YOLOv3 73.74 10 46.10 6 


5. CONCLUSION 

Advanced driving assistance systems are becoming more advanced with in-vehicle infrastructures. 
However, on rainy days, the detection rate remains poor. Rain streaks accumulate and obstruct the camera's 
view. In addition, most pedestrians wear raincoats or hold umbrellas on rainy days, resulting in a high number 
of occlusions. Due to the difficulty in detecting pedestrians in the rain, this study proposed a new PDS that 
includes a de-raining subsystem to detect pedestrians in both rainy and non-rainy conditions. Our proposed 
PDS outperforms both the existing YOLOv3 method and the traditional basic noises removable algorithms. 
Developing a neural network that excels in one area but fails in others is not a viable strategy for self-driving 
vehicles. Our long-term goal is to develop deep-learning architectures and solutions that can detect objects in 
a variety of environments. We are also interested in using thermal imaging cameras to detect pedestrians 
because of their ability to see in complete darkness, light fog, light rain, and snow. 
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